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Abstract 

Background: The Hedgehog Signaling Pathway is one of signaling pathways that are very important to embryonic 
development. The participation of inhibitors in the Hedgehog Signal Pathway can control cell growth and death, 
and searching novel inhibitors to the functioning of the pathway are in a great demand. As the matter of fact, 
effective inhibitors could provide efficient therapies for a wide range of malignancies, and targeting such pathway 
in cells represents a promising new paradigm for cell growth and death control. Current research mainly focuses on 
the syntheses of the inhibitors of cyclopamine derivatives, which bind specifically to the Smo protein, and can be 
used for cancer therapy. While quantitatively structure-activity relationship (QSAR) studies have been performed for 
these compounds among different cell lines, none of them have achieved acceptable results in the prediction of 
activity values of new compounds. In this study, we proposed a novel collaborative QSAR model for inhibitors of 
the Hedgehog Signaling Pathway by integration the information from multiple cell lines. Such a model is expected 
to substantially improve the QSAR ability from single cell lines, and provide useful clues in developing clinically 
effective inhibitors and modifications of parent lead compounds for target on the Hedgehog Signaling Pathway. 

Results: In this study, we have presented: (1) a collaborative QSAR model, which is used to integrate information 
among multiple cell lines to boost the QSAR results, rather than only a single cell line QSAR modeling. Our 
experiments have shown that the performance of our model is significantly better than single cell line QSAR 
methods; and (2) an efficient feature selection strategy under such collaborative environment, which can derive the 
commonly important features related to the entire given cell lines, while simultaneously showing their specific 
contributions to a specific cell-line. Based on feature selection results, we have proposed several possible chemical 
modifications to improve the inhibitor affinity towards multiple targets in the Hedgehog Signaling Pathway. 

Conclusions: Our model with the feature selection strategy presented here is efficient, robust, and flexible, and can 
be easily extended to model large-scale multiple cell line/QSAR data. The data and scripts for collaborative QSAR 
modeling are available in the Additional file 1. 



Background 

The Hedgehog Signaling Pathway plays an important role 
in regulating embryonic development in vertebrates, and it 
is highly conserved from flies to humans [1] [2] [3] [4] .The 
pathway name comes from a polypeptide ligand called 
Hedgehog (Hh), which is an intercellular signaling molecule 
in Drosophila. In Drosophila, the mutation of the gene in 
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the Hedgehog Signaling Pathway gives rise to an unusual 
spiky-haired phenotype [1]. The misregulation of such 
pathways has been directly associated with a variety of 
inherited and sporadic diseases [4] [5] [6]. The key role of 
the Hedgehog Signaling Pathway in the cell differentiation, 
growth, and proliferation makes it an excellent candidate in 
drug discovery, and thus targeting such pathway in cells 
represents a promising new paradigm for cell growth and 
death control 

The Hedgehog Signal Pathway is composed of four 
important components: Sonic Hedgehog, Patched, 
Smoothened and Gli transcription factors [3] (Figure 1). 
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Figure 1 Components of the Hedgehog (Hh) Signaling Pathway 
and molecular sites targeted by Hh pathway inhibitors. 



The functional Hh protein secreted from the membranes 
of the producing cells and initiates the Hh signaling cas- 
cade upon binding to the 12-pass transmembrane recep- 
tor Patched (Ptch). In the absence of an Hh ligand, the 
Patched receptor inhibits the activity of the downstream 
seven-pass transmembrane receptor Smoothened (Smo), 
which resembles G-protein-coupled receptors (GPCRs) 
in general topology. Active Smo then signals via a cyto- 
solic complex of proteins including Suppressor of Fused 
(SuFu), and the cascade culminates by triggering activa- 
tion of the glioma (Gli) family of transcription factors 
and their translocation to the nucleus. This activation 
results in the expression of specific genes that promote 
cell proliferation and differentiation [3]. 

The causal relationship between the activation of 
Hedgehog Signaling Pathway and oncogenesis has driven 
cancer researchers in the direction of finding specific 
inhibitors of hedgehog signaling, since this will provide 
efficient therapies to a wide range of malignancies [1,2]. 
To date, several druggable nodes within the pathway 
have been identified. Assays implanted on various cell 
lines have shown that small molecules were able to alter 
the activity of these targets. Among them, murine cell lines 
such as NIH 3 T3, TM3hl2, and C3H10T1/2 have been 
used [2]. While current cell lines allow the measurement of 
the inhibitory effects of compounds on the Hh pathway, 
they, however, provide little or no information about the 
specific underlying targets. To the best of our knowledge, 
only specific Smoothened inhibitors have been identified. 
Among them, the well-known BODIPY-cyclopamine, 
which is a fluorescent derivative of the naturally occurring 
Smo antagonist cyclopamine, binds specifically to cells 
expressing the Smo protein. This is one of the small chem- 
ical compounds that specifically inhibit Smoothened in the 
Hedgehog Signaling Pathway [2]. In our previous study [7], 



we have performed several quantitatively structure-activity 
relationship (QSAR) studies for cyclopamine derivatives in 
multiple cell lines, and such study could reveal useful clues 
in developing clinically effective drugs and modifications of 
parent lead compounds for cancer therapy. 

Recently, our partners have synthesized 93 cyclopa- 
mine derivatives and their activities were tested against 
four different cell lines (BxPC-3, NCI-H446, SW1990 
and NCI-H157) respectively [7] [8]. Based on these ex- 
perimental data, a systematical QSAR investigation was 
carried out by incorporation of various statistic model- 
ings and different molecular descriptors [7]. However, 
there are still several issues remain to be solved, which we 
believe that solving such problems will greatly enhance the 
understanding of inhibitors on Hedgehog Signaling Path- 
way, as well as the development of novel QSAR metho- 
dology. We describe the two major problems below: 

(1) In our previous QSAR study, for specific cell lines, 
the activities were categorized into a binary classification 
under a naive Bayesian model, and we obtained relatively 
acceptable QSAR results. However, no matter what 
kinds of statistical models or 2D descriptors were tested, 
low testing correlation coefficients were found when nu- 
meric activities were used. This may be due to the inhe- 
rent noise existed in experimental activity measurement, 
or the relatively small number of training data used for a 
specific cell line. Due to our compound data tested against 
multiple cell lines to evaluate their activities, we hypotheses 
that such information can be integrated to improve the 
QSAR results rather than only a single cell line QSAR 
modeling. Such investigation will be extremely useful for 
the scenario that a small number of compound activities 
are measured under different experimental conditions 
(such as different cell lines, targets, assays etc.), and will 
provide novel insights on the integration of existing infor- 
mation, avoiding repeatable laborious work in drug disco- 
very. In addition, such a study may also lead to novel 
computational models for integrated QSAR modeling, 
which is closely related to multi-task QSAR modeling [9] 
[10], Multi-Assay-Based QSAR modeling [11] , and Multi- 
target QSAR study [12], 

(2) Due to the existence of compound activity data 
against multiple cell lines, how can we integrate such in- 
formation to derive more robust and efficient feature se- 
lection strategies for compound modification under such 
"collaborative" multi-cell line environment? That is, can 
we derive the commonly important features related to 
the entire given cell lines for compound description, 
while in the meantime present their specific contribu- 
tions to a specific cell-line? This issue is closely related 
to the first one, but tougher to be solved since it needs 
much more domain knowledge. 

Inspired by these two problems, we aim to develop an 
efficient integrated QSAR model for inhibitors of Hedgehog 
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Signal Pathway against multiple cell lines. This type of 
model has been used for information retrieval in social 
network, i.e.. collaborative filtering [13] [14], and it has 
been widely applied by the web companies such as Google, 
Amazon. Dumitru Erhan etc. has pioneered to use the 
term "collaborative filtering" in multiple target study [15]. 
Nevertheless, their methodology can be categorized as 
multiple regression or neural network, and a complex ker- 
nel function for similarity measurement is needed. In this 
study, we will present a collective matrix factorization 
based collaborative filtering model for integrated QSAR 
modeling, which is more naturally suitable for QSAR 
modeling, and scales up well on large dataset. Further- 
more, we will also derive a powerful feature selection stra- 
tegy for collaborative compound design to get more 
efficient inhibitors of Hedgehog Signal Pathway. 

Methods and materials 

Dataset 

93 cyclopamine derivatives with their activities against 
four different cell lines (BxPC-3, NCI-H446, SW1990 
and NCI-H157) were obtained from our previous work 
[7]. The compound activity is measured by PK b as 
defined in the following Cheng- Prusoff equation [16]: 



([L] is the concentration of free radioligand used and 
K D is its equilibrium dissociation constant for the recep- 
tor [16]) 

Where IC 5 o (half maximal inhibitory concentration) is 
a measure of the effectiveness of a compound in inhibi- 
ting biological or biochemical function. More specifi- 
cally, it indicates how much of a particular drug or other 
substance (inhibitor) needed to inhibit a given biological 
process (or component of a process, i.e.. an enzyme, cell, 
cell receptor or microorganism) by half. In our study, 
the data are formulated as a data matrix X Note that 
the collective matrix factorization requires the matrix to 
be non-negative. In our original experiments, we mea- 
sured the compound affinity under the PI<i evaluation 
system, and the activity values were negative. Since the 
PKi measurement is calculated by taking IC 50 as the in- 
put in equation (1), we can just take the absolute value 
of the PKi in QSAR modeling , and this will not affect 
our final results. 

Definitions and Notations 

In this paper, the different cell lines and the compounds 
tested for Hedgehog Signal Pathway will be denoted as t 
and c respectively, and their corresponding subscripts 
denote a specific compound and cell line. Thus, for a 
specific compound c b its experimentally activity value 
(measured as PKi ) against specific cell line tj is denoted 



as Xij. We can build a m by n dimensional matrix X, 
where m is the number of the compounds and n is the 
number of cell lines. 

Each compound will be represented by a vector of 
descriptors, denoted as a matrix Y with m by r dimen- 
sions, where m is the number of compounds and r is the 
length of the corresponding descriptor vector. Similar to 
our previous study [7], two different molecule descrip- 
tors, general descriptor [17] and drug-like index (DLI) 
[18] will be used for compound representation. 

Collaborative filtering for multiple cell line QSAR 
modeling 

Based on the above definitions, it can be seen that the 
traditional single cell line QSAR modeling is applied on 
the data in a specific column of matrix X. In this study, 
we are more interested in incorporating the information 
from other columns (cell lines) to enhance the perfor- 
mance of the QSAR modeling for a particular column 
(cell line). This scenario is similar to the recommenda- 
tion system presented by Electronic retailers and content 
providers such as Amazon.com and Netflix [14], which 
make automatic predictions (filtering) of users' interests 
by collecting preferences or taste information from 
many users (collaborating), naturally termed as "colla- 
borative filtering (CF)". 

Formally speaking, in a typical CF scenario, there is 
a list of n users {u lf u 2 , . . . , u n } and a list of m 
items {i lf i 2 , . . . , i m }> and each user, u b has a list of 
items, lu b which the user has rated, or about which 
their preferences have been inferred through their 
behaviors. The ratings can either be explicit indication 
on a 1-5 scale, or implicit indication such as pur- 
chases or click- throughs [13]. Such a user-item rela- 
tionship can be formulated as a matrix, which may be 
sparse and can have missing values (i.e. users did not 
give their preferences). The goal of CF is to predict 
such missing values based on the existed information 
of users/items to make the reasonable recommenda- 
tion (Left Panel of Figure 2). 

Such a CF scenario is inherently suitable for our mul- 
tiple cell line QSAR modeling. In our study, the former 
"cell line- compound" matrix X can be viewed as a kind 
of "item-user" matrix, where "compound" is analogue to 
"item" and "cell line" is analogue to "user" (Right Panel 
of Figure 2). The traditional single cell line QSAR mo- 
deling uses the data restricted in a specific column of 
matrix X to train and test. From the perspective of ma- 
chine learning, we just hold part of the data in the column 
as testing dataset, and use the other part of the data in the 
column to train a QSAR model. This procedure can be na- 
turally extend to the multiple cell line QSAR modeling 
under the CF framework, where we can treat the testing 
data in a specific column as "missing" value and using the 
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Figure 2 A paralleling comparison between collaborative QSAR modeling and collaborative filtering in social network community. 



remain data from this column as well as the data from 
other columns to predict such missing values. 

Collective matrix factorization for collaborative filtering in 
the multiple cell line QSAR modeling 

We formulate the multiple cell line QSAR modeling prob- 
lem as a collaborative filtering problem. There are two 
existing techniques for solving collaborative filtering, i.e., 
the neighborhood methods and latent factor models. 
Neighborhood methods are centered on computing the 
relationships between items or, alternatively, between users 
for missing value prediction, while latent factor models 
characterize both items and users on, say, 20 to 100 factors 
inferred from the ratings patterns [19] [20] [21]. Generally, 
realizations of latent factor models are based on matrix 
factorization. In its basic form, matrix factorization charac- 
terizes both items and users through vectors of factors in- 
ferred from item rating patterns. High correlation between 
item and user factors leads to a recommendation. These 
methods have become popular in recent years by combi- 
ning good scalability with predictive accuracy. Thus, we will 
present a matrix factorization based multiple cell line 
QSAR modeling method in our study. 

Specifically, we have matrix X £ R™ x n , where X t p 
epresents the activity measurement of compound i 



against specific cell line Noted that X is sparse in a 
specific column since we will hold part of elements in 
this column as the testing data (missing values) for 
QSAR modeling. We use an indicator matrix / £ R m x n 
to represent the missing values, where Iy = 0 if Xg is 
missing and = 1 otherwise. 

We denote by Xj., l<i<mandX. ; ,l<j<n the ith row and 
yth column of X, which represent the ith compound's ac- 
tivities against all the cell lines and the activities of the 
yth cell line for all the compounds, respectively. 

In a basic matrix factorization model, we usually seek 
two low-rank matrices, U £ R™ xd and V £ R n + xd . The 
row vector U{ and Vj represent the low-dimensional 
representations of compounds and cell lines respectively. 
We use matrix UxV T to approximate the original 
matrix X, thus to fill/predict the missing values. Such 
matrix factorization can be achieved by solve the follo- 
wing optimization function: 



min L(U,V) 
Where 



I(U,V) = ||/o(X 
+ X 2 



(2) 



uv 1 



■XiHU 



(3) 
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In equation (3), the operator "o" denotes the entry- 
wise product || * \ \f denotes the Frobenius norm. The 

last 2 terms add regularizations to the matrix U and V 
by avoiding over-fitting the observed data. The para- 
meters Ai and A 2 control the extent of regularizations, 
and they are usually determined by cross-validation. 

We also have the compound description information 
as described in matrix Y in order to use such auxiliary 
information to aid a more reasonable reconstruction of 
matrix X. We further presented a collective matrix 
factorization (CMF)[22] method for multiple cell line 
QSAR modeling. The CMF method was recently pre- 
sented by machine learning community [22], and it 
jointly factorizes multiple matrices simultaneously, 
assuming that they share several common latent fac- 
tors. To be more specific, given a compound - cell line 
matrix X G R™ xn , and a compound description matrix 
Y e R™ xn , we extend the optimization function (2) and 
(3) to the following: 

min L(U, V, W) (4) 
Where L(U, V, W) 

= i||/o(X-UV T )|| 2 F +|||Y-UW T || 2 F 

+ y(l|U|| 2 + ||V|| 2 + ||W|| 2 ) (5) 

Equation (5) is similar to equation (3), it reconstructs 
X « UV T and Y « UW T by sharing the common factor 
U, where X G R™ xn , Y G R™ xr , U G R™ xd , V G i^and 
W eR rxd . U, Vand W are low-dimensional matrices with 
dimensiond < min(m, r). By solving such optimization 
function, we can successfully incorporate the infor- 
mation of the multiple cell line compound activities 
and compound description for a better missing value 
prediction. 

In general, the objective function (5) is not jointly con- 
vex to all the variables U, V,W, and we cannot get 
closed-form solutions for minimizing the objective func- 
tion. Therefore, we will turn to some numerical method 
such as gradient descent to get the local optimal solu- 
tions. Specifically, we have the gradients as: 

V U L = [Jo (UV T - X)] V + Ai (UW T -Y)W 

+ hU (6) 

V V L= [/o(UV T -X)] r £/ + A 2 F (7) 

V W L= [/o(UW T -Y)] r £/ + A 2 W- (8) 

After obtaining the gradients, we can use gradient 
descent to iteratively minimize the objective function. 



The algorithm for the collective two matrix factorization 
is given below: 

Algorithm 1: collective matrix factorization for multiple 
cell line QSAR modeling 

Input: An incomplete matrix X and a complete matrix 
Y, where X represents the compound activities in 
multiple cell lines with missing values in specific 
column, Y represents the compound description matrix. 
Output: The complete matrix for X. 

Begin 

1. t= 1; 

2. While (t < Tand L t - L M > e do 

3. Get the gradients V M L, V v £, V W L by Equation (5)- 
(7); 

4. y = 1; 

5. While (L(U, - yV ut L,Vt - yV vt L,W t - 
yV wt L)>L(\J t y t ,W t )) do 

6. y = y/2; 

7. End 

8. V M =V t -yV ut L,V M 

= V t - yV vt L, W m = W t - yV wt L 

9. t = t + 1; 

10. End 

11. return X; 

End 

Performance measurement 

In order to demonstrate the efficiency of collective matrix 
factorization based multiple cell line QSAR modeling, we 
compare our approach with two other base line methods, i. 
e., linear ridge regression and support vector regression 
(SVR) for single cell line QSAR modeling used in our pre- 
vious study [7]. For the purpose of equal comparison, we 
apply the following two testing strategies for each specific 
cell line: (1). We randomly selected 2/3 of the data to train 
the linear ridge regression and SVR, and the remaining data 
as to test these two methods. These two base line methods 
are compared with collective matrix factorization based 
QSAR method, where the same testing data (missing 
values) are predicted based on the original training data for 
this specific cell line plus the data from other cell lines. The 
whole procedure was repeated 10 times. (2) In order to 
evaluate the QSAR model more rigorously and consider 
the representative ability of the compounds in training 
dataset, we applied another data partition strategy, i.e., 
Diverse Subset data division method [7], which is com- 
monly used in the chemoinformatics community. Generally 
speaking, the Diverse Subset method ranks compound en- 
tries based on diversity. In the procedure of data division, 
the first entry of the original dataset is taken as a reference 
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and will always be viewed as part of a diverse subset. Then 
the most "distinct" compound data is assigned #2, and then 
the most distinct compound to these two is assigned #3 
and so on until the required number of diverse compounds 
is identified or the whole dataset is ranked in diversity order 
[7]. In this study, we also select 1/3 of the data as testing 
dataset and the remaining data as the training dataset, while 
such partition is generated in a Diverse Subset way rather 
than randomly to keep the representative and distinct cha- 
racteristics of data. 

Two classical measurements, i.e., Root mean squared 
error (RMSE) and squared correlation coefficient (T-- 
square) were adopted as the performance evaluations for 
testing results. The definitions of these statistical para- 
meters are provided as follows: 

Root mean squared error (RMSE): 



RMSE 



n 2-^i=i 



(9) 



where n is the number of test compounds, — yi —y it 
is the difference between the observed compound affi- 
nity data and the fitted model, yi is the observed com- 
pound affinity, y t is the predicted compound affinity. 
Squared correlation coefficient (R2): 



SSpj 



Ocalc\ 



. pavg\ 



(10) 



where P avg is the average value of P- exp over the n pre- 
dicted compound affinities. 

Feature selection based on CMF for compound 
description among multiple cell lines 

Under such collaborative QSAR schema, we presented a 
novel feature selection model for compound descriptions 
weighting, which is also derived from the content-based 
recommender systems and collaborative filtering [23]. 
Basically, we want to quantify the effect of each com- 
pound feature against a specific cell line (weighting for 
intra-cell line) as well as among all the cell lines (weight- 
ing for inter-cell line). The final feature weighting is an 
integration of the two types of weighting, where both 
specific and the whole cell lines contribute. Such a fea- 
ture selection strategy is attractive in multiple cell line 
QSAR modeling, since it can provide useful clues of 
how to modify chemical compounds to improve their 
activities for a specific target, or for all given cell lines 
simultaneously. While the latter one is a key step for 
multi-target compound design. 

Specially, given a compound activity-cell line matrix 
X (m by n) and compound-feature description matrix 
Y (m by r), we want to derive a cell line-description 
feature weighting matrix Z (n by r), where its element 



Zij is the weight of a compound feature ; in cell line I 
The value of element is contributed from two sides, 
i.e. intra-cell line and inter-cell line. The generally pro- 
cedure for computing a weight for each compound 
feature is based on (1) the amount of information pro- 
vided by itself , and (2) the correlation between the 
compound feature and a specific cell line. Three steps 
are performed here: 

Step 1. Weighting for inter-cell line. For each 

compound feature c ; , an entropy based method is 
applied to compute the amount of information that 
it can offer regardless of cell line, as denoted as Hj. 

Step 2. Weighting for intra-cell line. For each 

compound feature Cj and a specific cell line tp the 
correlation between compound feature and the cell 
line is calculated. This calculation will depend on 
the nature of the features (qualitative, quantitative). 
Two kinds of correlations, i.e., correlation coefficient 
and contingency coefficient [23] are proposed for 
quantitative features and qualitative features 
respectively. 

Step 3. Calculation of the final weights. The feature 
weight is obtained as a result of the product of 
entropy and degree of dependency. 

A generally outline of the proposed feature selection 
strategy is presented in Figure 3. Detailed information 
can be referred to the original work [23] . 

Results 

We performed a comprehensive study of the collective 
matrix factorization based multiple cell line QSAR mo- 
deling for the inhibitors of Hedgehog Signaling Pathway 
as described in Section 2. In the rest of this section we 
present and summarize the key results from this study. 
The performance of our method was compared with the 
baseline QSAR models of liner ridge regression and 
SAR. Details are listed in the following. 

Performance of the collective matrix factorization based 
QSAR modeling 

Figures 4 and 5 present average improvements achieved by 
the CMF based multiple cell line QSAR modeling over the 
baseline methods for four cell lines, with two different kinds 
of drug representations, i.e., general descriptor and drug- 
like index respectively. Figure 4 shows the performance re- 
sult of the first partition strategy, where the test was carried 
out under certain parameter setting and with 10 times 
repetition by randomly selected 2/3 data as training dataset 
and 1/3 data as testing datasets. Figure 5 shows the second 
strategy, where the test was carried out under certain par- 
ameter setting with diverse subset to consider the data rep- 
resentative ability in the training and testing dataset. 
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Figure 3 Feature weighting based on entropy and dependency measures for collaborative QSAR modeling. 



From Figure 4-5, it can be seen that that the different 
data partition strategies actually achieve the similar per- 
formance results. For all the cell lines and all the kinds 
of data representation, the performance improvement of 
collaborative QSAR modeling was dramatic, especially 
for the evaluation of i?-square. The improvement is sta- 
tistically significant, with significant ^-value measured by 
RMSE and i?-square respectively. We had already 
noticed in our previous study [7] that under the mea- 
surement of i?-square, the QSAR modeling results for 
the four cell lines with numeric compound activities 
were not satisfied, indicating a satisfiable QSAR mode- 
ling against single cell line individually was hard to ob- 
tain. In contrast, in our current collaborative QSAR 
modeling, performance against all the cell lines was 
improved. The significant improvement margin evalu- 
ated by i?-square indicates that our CMF based QSAR 
modeling could successfully capture the correlation, 



rather than its absolute value of difference among the 
dataset as evaluated by RMSE. 

Besides the measurements of the average RMSE and 
i?-square of different QSAR models, we also investigated 
their error distribution under the diverse subset partition 
strategy to give a more rigorous comparison of their per- 
formance. It can be seen from the boxplots of the error 
square (Figure 6-7) that for both two compound descrip- 
tions, collaborative QSAR modeling achieved the lowest 
error means and low variances compared to other two 
baselines, indicating the best prediction ability among all 
methods. 

It should be noted that in our previous study we found 
that different cell lines perform differently for modeling 
the inhibitor affinity based on the linear regression or 
SVR. Particularly, only the data of NCI-H446 could pro- 
duce a reasonable model by QSAR analysis, probably 
due to the fact that the other three cell lines may be less 
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Figure 4 Comparison of CMF based QSAR modeling with ridge regression and SVR on the inhibitors of Hedgehog Signaling Pathway. 

The compounds are represented with Drug-like index and General Descriptor respectively. The whole data was partitioned randomly with 
selecting 2/3 data as training dataset and 1/3 data as testing dataset. The whole procedure was repeated 10 times and the averaged 
performance was calculated. Cell line No. 1-4 corresponding to BxPC-3, NCI-H446, SW1990 and NCI-HI 57 respectively. 



sensitive as HCI-H466 cells to the hedgehog signaling in- 
hibitor [7]. Nevertheless, it can be seen from Figure 4-7 
that if we combine all these data from different cell lines 
together under the CMF based QSAR modeling, we can 
greatly reduce such non-specific effects in the cell lines, 
and result in a reasonable QSAR modeling against all 
the cell lines respectively. Such improvement is attribu- 
ted to the fact that the collaborative filtering based 
framework allows different cell line data tasks to en- 
hance each other during the training process, which 
eventually makes the efficacy modeling better than 
those of using the datasets separately. We believe that 
such "collaborative" scenario for drug analysis will be- 
come more popular in the future, as more and more cell 
line will exist and the drug are often required to be 
investigated under various circumstances. 

Finally, in order to evaluate whether our collaborative 
QSAR model is general enough for new predictions, we 
also checked the domain of application (DOA) for the 
model under the diverse subset partition strategy. The 
domain of application (DOA) is used to estimate the re- 
liability in the prediction of a new compound [24] for a 
specific method. Those molecules fall out the domain 



may lead to unreliable predictions [10]. In the analysis of 
DOA, a value of leverage hi is defined in equation (11) 
for each chemical molecule: 

h i =xJ(x T xy 1 x i (11) 

Where X t is the row-vector descriptor of the query 
compound, X is the n x k matrix containing k descriptor 
values and n training samples. The superscript T is the 
transpose of the matrix or vector. Generally, the warning 
leverage /z* is fixed at 3k /n, where n is the number of 
training compounds, and k is the number of descriptors. 
When the leverage is greater than the warning leverage 
/z*, the predicted activity is the result of substantial 
extrapolation of the model and, therefore, it may not be 
reliable and tend to be over-fitting. 

Based on the definition of leverage, Williams plot was 
used in this study to visualize the DOA of the QSAR model 
[10]. The Williams plot plots the standardized cross- 
validated residuals (RES) versus leverage values (h) } and can 
be used to obtain an immediate and simple graphical detec- 
tion of both the response outliers (Y outliers) and the struc- 
turally influential chemicals (X outliers) of a model. 
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Figure 5 Comparison of CMF based QSAR modeling with ridge regression and SVR on the inhibitors of Hedgehog Signaling Pathway. 

The compounds are represented with Drug-like index and General Descriptor respectively. The whole data was partitioned with Diverse subset 
method. Cell line No. 1-4 corresponding to BxPC-3, NCI-H446, SW1990 and NCI-HI 57 respectively. 



Generally, the points with their values of l^axis fall outside 
the 3a line (a is the standard residuals unit of the com- 
pounds) can be considered as Y outliers, while the points 
with their values of X axis fall outside the warning leverage 
/z*line can be considered as X outliers [10]. Figures 8 and 9 
represent the William plots for the four cell lines with com- 
pound representations of General Descriptor and Drug-like 
index respectively. It can be seen that for all four cell lines, 
most of the compounds fall into their corresponding ap- 
plication domain, which indicate that the collaborative 
QSAR modeling has achieved a reliable activity prediction 
for the compounds, and they are following a well-defined 
domain of applicability. 

Impact of the Regularization Parameters 

In this subsection, we will investigate the impact of 
the regularization parameters on our CMF-based 
QSAR modeling. We choose the values of Ai and A 2 
under different dimensionality of low dimensional re- 
presentations and different numbers of training rat- 
ings, and plot the RMSE based on the whole four cell 
line data as shown in Figure 10. The tests were per- 
formed with different compound descriptions, i.e., General 



Descriptors and Drug-Like Index respectively. In the figure, 
x-axis corresponds to different value of regularization par- 
ameter (0.001, 0.01, 0.1, 1, 10, 100) while y-axis corresponds 
to the number of training ratios for the whole QSAR data 
(15 %, 35 %, 55 %, 75 %). It can be seen that (1) basically the 
influence of the regularization parameter is small on the 
performance, indicating that our proposed method is 
robust and insensitive to the parameters, (2) higher per- 
formance will be achieved with the larger number of train- 
ing samples, which is not surprising in our study, and (3) 
generally the two compound description, i.e, Drug-Like 
index and General Descriptor, performed the same in CMF 
with no statistically different. 

Feature selection based on CMF for compound 
description among multiple cell lines 

Using collaborative filtering based feature selection 
strategy we proposed aforementioned, we obtained the 
feature weighting for intra-cell line and inter-cell line 
for the inhibitors of Hedgehog Signaling Pathway. The 
former one can be used to uncover the important fea- 
tures in inhibitor design against a specific cell line, 
while the later one is used to identify common features 
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the training and testing data was partitioned with Diverse subset method. 



that are important for the inhibitors against multiple 
cell line simultaneously. We compared the difference 
between these two kinds of feature weighting to provide 
useful clues for inhibitor modifications and improve 
their affinities. 

In this feature selection, we used Drug-like index to re- 
present each compound, with the total of 28 features, since 
it is easy to interpret biological meanings. The General de- 
scriptor feature space has been hybridized, and the original 
meanings of compound structure description for current 



features couldn't kept. Therefore, GD will not be adopted 
for feature weighting here. Generally speaking, Drug-like 
index belongs to the category of structural descriptors. 
Structural descriptors can correlate with each other; some 
of them may be redundant. However, if they have different 
and significant distributions in the considered drug class, 
they can be used for drug-knowledge extraction and the 
redundant can be ignored. In our study, the descriptors 
maintain their identity and clearly interpretable struc- 
tural significance throughout the process. A table with 
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detailed descriptions of each drug-like index is listed in 
Additional file 2: Table SI. 

The 28 feature weights for the intra-cell line, inter- 
cell line and the final integrated one are shown in 
Figures 11, 12 and 13. In all three figures, x-axis repre- 
sents the Drug-like index feature ID and y-axis repre- 
sents its corresponding weights. It can be seen that the 
final integrated feature weighting is different from that 
of intra- cell line. Moreover, the inter-cell line feature 
weighting can be viewed as an efficient way to identify the 
potential features important for multi-target inhibitors of 
Hedgehog Signaling Pathway. We provide our insights 
about inhibitors design based on these figures: 

1) As shown in Figure 11, the features of '# of non-H' 
(DLI1),'# of non-H polar bonds' (DLI5) and '# of 2- 
degree cyclic atoms' (DLI13) were ranked top 3. These 
findings indicate that the volume of the molecular, the 
polar of the molecular and the cyclic degree of the 



molecular are the most important features for the 
design of multi-target inhibitors of Hedgehog Signaling 
Pathway. Our findings are actually consistent with the 
empirical rules for lead compound optimization, which 
use these three elements to determine their activities. 
2) We can see from Figure 11 that the feature '# of cap 
fragments' (DLI23) was also important when the 
multi-cell line inhibitors were designed. This is 
consistent with the empirical rule, which changes 
the substituent group (functional group) in order to 
improve the inhibitor activity. However, compared 
with Figure 12, it can be seen that the importance of 
this feature for multi-cell line inhibitor design is not 
as much significant as that for individual cell lines. 
This could be explained that, although this feature is 
important for individual cell line inhibitor, their 
activity improvement directions may be inconsistent, 
thus reducing its importance when multi-cell lines 
are confronted. 
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3) All three figures (Figure 11, 12, 13) have shown that 
the weight for the feature of '# of 3-level bonding 
patterns' (DLI18) is 0. This is probably due to the 
following two reasons:a) All compound samples in 
our study are lack of this feature, and b) this feature 
is not considered in most of the in-silico compound 
optimizations. 

Discussion 

Comparison of CMF based QSAR modeling with other 
collaborative QSAR modeling 

Although the CMF based QSAR modeling was investigated 
in our study, we do realize the existence of other QSAR 
modeling with integrated information, and we call such 
models as the "collaborative" QSAR modeling, like the 
neural network based [15] [25-27] and multi-task learning 
based [9] [10] models, as well as the proteochemometrics 
modeling (PCM) [28] [29]. In order to further uncover the 



characteristics of such collaborative QSAR modeling, we 
discuss our CMF based method with the aforementioned 
methods on our multiple cell line QSAR modeling for the 
inhibitors of Hedgehog Signaling Pathway. 

Neural network based collaborative QSAR modeling 

As we mentioned above, Erhan etc. proposed a neural 
network based collaborative QSAR modeling for drug 
discovery [15]. This is one of the first attempts to con- 
struct an efficient procedure for integrating multiple 
drug target information at a time by extending standard 
multi-layer neural networks. Basically, neural networks 
provided an ideal test bed for implementing collabora- 
tive QSAR modeling: the simplest of such form was to 
create a shared hidden layer that is trained in parallel for all 
the learning tasks. In this case, the training procedure 
would be done on all the tasks (in our study it will be all 
the cell line QSAR data) in parallel. Because the structure 
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of the network includes a shared layer (weight matrix), it is 
possible for so-called "shared internal representations" to 
develop and to be learned. 

Specifically, in our multiple cell line QSAR modeling 
for the inhibitors of Hedgehog Signaling Pathway, we 
used a 10-cross fold validation schema to test our data 
from 4 cell lines in this neural network model. The 
weights from input layer to hidden layer as well as from 
hidden layer to output layer for the network will be 
learned through the back propagation (BP) algorithm. 
Our in-house test indicated that the neural network based 



collaborative QSAR modeling was comparable to CMF 
based QSAR modeling, with no surprisingly better than the 
single QSAR modeling (Results are not shown here). 

Multi-task learning based collaborative QSAR modeling 

Neural network can be viewed as a specific form of 
multi-task learning. Multi-task learning has been devel- 
oped for those situations where multiple related learning 
tasks are to be accomplished together. When explicit or 
hidden interrelationship among the tasks can be 
exploited [9] [10], multi-task learning is more effective 
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Figure 11 Feature weighting for inter-cell line of drug-like index. 



than learning each task independently. The intuition 
underlying the framework is that the multiple related 
tasks can benefit each other by sharing the data and fea- 
tures across the tasks, and thus boosting the learning 
performance of each single task [30]. It also provides an 
efficient mechanism for cross -task feature selection, thus 
uncovering the common dominate features for all the 
tasks simultaneously. Our group has successfully applied 
multi-task learning in QSAR modeling with specific 
study of HIV and HCV inhibitors [9] [10]. Basically, as- 
sume that the datasets contain N tuples, Zi = (xi,yi,ki) 
for i = {1. . .A/}, where X; £ R^ is the drug descriptor, and 
hi £{1...M} is the indicator corresponding to the ex- 
ample (xi,yi). The M tasks correspond to M different 
cell lines or drug targets. A critical issue in this collab- 
orative QSAR modeling is to learn a set of sparse func- 
tions across these tasks for drug activity regression. This 
is commonly achieved by learning M linear regressions 
of the form w\x, with the following square loss function 
is adopted (other loss function can also be applied): 



l,(z, W) = (y- 



\2 



(11) 



where z = (x,y, k) , W = [wi, w 2 , . . . , w M ] £ R dxM and 
W be the ;th row of W. 

In the multi-task learning framework, W can be opti- 
mized and calculated by enforcing the joint sparsity 
across different tasks with adding the different norm of 
the matrix W to the square loss function, which leads to 
only a few non-zero rows of W. 

The relationship between collaborative filtering and 
multi-task learning has been discussed in previous studies 
[30]. The multi-task learning model is closed related to 
the multiple response regression models [31]. Multiple 
response regression is the task of estimating several 
response variables using a common set of input variables. 
In general, both multi-task learning and multiple response 
regression can be used to find the correlation between dif- 
ferent tasks, and thus improve the single task learning. 
Such an approach have many potential applications in 
various areas, interested readers may be referred to the 
paper [31]. It should be noted that in the multi-task lear- 
ning framework, the samples for different tasks should 
not be identical. In general, the less overlap of the samples 
containing across different tasks, the more prediction 
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ability of each task. This idea is related to another interest- 
ing algorithm, transfer learning [32], whereas multi-task 
learning can be categorized into this area and the informa- 
tion between different tasks is expected to "transfer" from 
each other to boost the performance of individual task. 

For the particular data in our multiple cell line QSAR 
modeling for the inhibitors of Hedgehog Signaling Pathway, 
it can be seen that the drug samples for all the cell line are 
totally identical, thus it is unnecessary to use multi-task 
learning in the collaborative QSAR here. Nevertheless, if 
non-identical samples for multiple cell line exist, multi-task 
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learning will be a good choice for collaborative QSAR mo- 
deling with integrating of different data sources. 

Proteochemometric Modeling 

Proteochemometric modeling (PCM) is presented based 
on the similarity of a group of ligands and a group of 
targets, to the extent that PCM models the so-called 
ligand- target interaction space [28] [29]. Like QSAR 
modeling, the PCM model is constructed based on 
chemical descriptors that describe the compound data 
set and it introduces an additional term, a descriptor of 
the protein - target interaction (Figure 14). Therefore, a 
PCM model is constructed on both ligand and target simi- 
larity, and it can be regarded as an extension of conven- 
tional QSAR modeling, which models the relationship 
between multiple compounds and targets simultaneously. 
PCM is intrinsically the most similar to our collaborative 
filtering based QSAR modeling among all others. PCM ex- 
plicitly requires the target information as well as the 
protein-target interaction descriptions. Whereas in our col- 
laborative filtering based QSAR modeling, these two kinds 
of information are implicitly embedded in one computa- 
tional schema. From this point of view, our model is more 
flexible and extendable. Since in our multiple cell line 
QSAR modeling for the inhibitors of Hedgehog Signaling 
Pathway, there is no explicit target information available, 
we cannot use PCM for the QSAR modeling. Large-scale 
ligand-target relationship study and comparison between 
collaborative filtering based methods and PCM still remain 
to be an interesting and useful topic for the future study. 

Conclusions 

In this study, an efficient collaborative QSAR model for 
inhibitors of Hedgehog Signal Pathway from multiple cell 
lines was proposed. The model is derived from the area of 
information retrieval in social network, i.e. collaborative 
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filtering, and its performance is well demonstrated and 
explained in our study. By applying this elegant computa- 
tional model, we successfully addressed two issues re- 
mained in our previous study, i.e., (1) The information 
among multiple cell lines can be integrated to boost the 
QSAR results, rather than single cell line QSAR modeling. 
Our extensive experiments indicated that the performance 
is remarkable compared to other single cell line QSAR 
methods. (2) A novel feature selection strategy under such 
collaborative environment was proposed, which can be 
used to derive the commonly important features related to 
the entire given cell lines, while meantime presenting their 
specific contributions to a specific cell-line. Based on the 
results of feature selection, we presented several ways of 
chemical modifications which will likely improve the com- 
pound affinity towards multiple targets in the Hedgehog 
Signal Pathway simultaneously. In summary, our study pro- 
vides useful clues for multiple cell line/targets QSAR mo- 
deling when the cell line or target information among a 
related pathway exist. The proposed collaborative model 
with the feature selection strategy here is efficient, robust, 
flexible, and can be easily extended to model large-scale 
multiple cell line/QSAR data. 

Additional files 



Additional file 1: The data and scripts for collaborative QSAR 
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